An Analysis of 2020 Presidential Campaign Speeches
2025-05-06
Is there a correlation between aggressivity and rhetorical complexity in Donald Trump’s 2020 presidential campaign speeches?
Let’s look at a comparison between Obama and Trump.
Chalkiadakis, Ioannis and Anglès d’Auriac, Louise and Peters, Gareth and Frau-Meigs, Divina, A text dataset of campaign speeches of the main tickets in the 2020 US presidential election (September 20, 2024)
\[Aggression Ratio=Number of Aggressive Words/Total Number of Words\]
Visualizing a subset consisting of 21 of the most aggressive speeches, with a ratio above 0.206258 in the 75th percentile.
Flesch-Kincaid Reading Ease
Speeches with a flesch_score above 68.72: Trump’s simplest speeches. The subset for the 75th percentile consists of 59 speeches, of 235 total.
Latent Dirichlet Allocation (LDA) to identify the top topics in the 75th percentile of aggressive speeches.
LatentDirichletAllocation(n_components=11, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LatentDirichletAllocation(n_components=11, random_state=42)
Topic #1: know peopl said want dont say great thing right think
Topic #2: iran terror nation countri futur year want state sign think
Topic #3: act nation decis presid militari secur administr author section abil
Topic #4: presid trump biden want elect said vote peopl pennsylvania dont
Topic #5: iran unit nuclear iranian regim world state sanction missil weapon
Topic #6: thank american unit viru state flag action world nation peopl
Topic #7: woman appoint holocaust famili day ensur busi unit state american
Topic #8: border countri law immigr illeg secur mexico unit state year
Topic #9: american china peopl hong kong cancer year state world unit
Topic #10: race sex order agenc feder state shall individu train child
Topic #11: american america nation countri year thank state great peopl world
Training LDA model for 2 topics...
Number of topics: 2, Coherence Score: 0.41836909439474695
Training LDA model for 5 topics...
Number of topics: 5, Coherence Score: 0.40381606014590804
Training LDA model for 8 topics...
Number of topics: 8, Coherence Score: 0.4081563907379523
Training LDA model for 11 topics...
Number of topics: 11, Coherence Score: 0.4211153278595033
Training LDA model for 14 topics...
Number of topics: 14, Coherence Score: 0.4059432468988091
Training LDA model for 17 topics...
Number of topics: 17, Coherence Score: 0.42100164166919735
Training LDA model for 20 topics...
Number of topics: 20, Coherence Score: 0.4120166602889251
LatentDirichletAllocation(n_components=11, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LatentDirichletAllocation(n_components=11, random_state=42)
Topic #1: peopl think thing meet good lot know number big countri
Topic #2: thank peopl countri want know great theyr think dont said
Topic #3: dont want know said peopl say year theyr think right
Topic #4: crowd number great happen mani weve thank big come tremend
Topic #5: said think approv thing peopl militari magazin want know make
Topic #6: presid trump said know want dont peopl year say great
Topic #7: lawn mcconnel south spike trip juli phenomen mildli review staff
Topic #8: percent charg countri think lot deal great tariff trade happen
Topic #9: lawn mcconnel south spike trip juli phenomen mildli review staff
Topic #10: meet mildli ohio pretti question land new hour folk coupl
Topic #11: peopl thank great know said laughter right like year american
Training LDA model for 2 topics...
Number of topics: 2, Coherence Score: 0.4120166602889251
Training LDA model for 5 topics...
Number of topics: 5, Coherence Score: 0.4120166602889251
Training LDA model for 8 topics...
Number of topics: 8, Coherence Score: 0.4120166602889251
Training LDA model for 11 topics...
Number of topics: 11, Coherence Score: 0.4120166602889251
Training LDA model for 14 topics...
Number of topics: 14, Coherence Score: 0.4120166602889251
Training LDA model for 17 topics...
Number of topics: 17, Coherence Score: 0.4120166602889251
Training LDA model for 20 topics...
Number of topics: 20, Coherence Score: 0.4120166602889251
More diverse selection of documents (tweets, statements made on social media, and transcriptions of video clips)
Comparative analysis of the 2020 versus 2024 presidential campaign: how has Trump’s rhetoric changed over time?
Use of LLMs, such as OpenAI API, to code aggression rather than the dictionary method.
Compare politicians from different parties (Republican versus Democrat)
Monthly Average Aggression Ratio
american_words = [
"abuse", "abysmal", "accusation", "accusations", "accuse", "accusing", "adversarial",
"aggressive", "anger", "angered", "annoyance", "annoyed", "annoying", "antagonistic",
"antagonize", "appalling", "archaic", "arrogance", "arrogant", "ashamed", "assault",
"assaulted", "assaulting", "attacking", "atrocious", "backtalk", "bitter", "bitterly",
"bitterness", "blackened", "blackmail", "blame", "blamed", "blaming", "blunder", "bogus",
"botch", "botched", "betray", "betrayed", "betrayal", "clownery", "chaos", "chaotic",
"complain", "complaining", "condemn", "confront", "confrontation", "confrontational",
"crass", "coward", "cowardly", "criticize", "criticized", "criticizing", "cruel", "cruelty",
"debase", "debased", "deceit", "deceived", "deceive", "deception", "devious", "deviousness",
"despicable", "disgrace", "disgraceful", "disgusting", "dishonest", "dishonorable",
"disregard", "disreputable", "distasteful", "dodgy", "dull", "embarrass", "embarrassing",
"embarrassment", "fabricator", "fail", "failed", "failure", "failures", "faithless", "farcical",
"fiasco", "fibber", "fiddle", "fiddled", "fool", "foolish", "fraud", "fraudulence",
"fraudulent", "furious", "gimmick", "good-for-nothing", "groan", "grotesque", "hackery",
"half-truths", "hate", "hatred", "hodgepodge", "horrendous", "hostile", "hostility",
"humiliate", "humiliating", "hypocrisy", "hypocrite", "idiot", "idiotic", "ignorance",
"ignorant", "ill-judged", "ill-mannered", "immoral", "inadequacy", "incapable", "inferior",
"insult", "insulted", "insulting", "intolerant", "ironic", "irony", "irritated", "jumble",
"laughable", "lawbreakers", "leech", "libelous", "ludicrous", "mess", "misbehave", "mischief",
"mischievous", "mislead", "misleading", "needless", "needlessly", "neglect", "neglected",
"neglectful", "negligent", "nonsense", "nonsensical", "nasty", "obnoxious", "offend",
"offenders", "outrageous", "outraged", "patronize", "patronizing", "petty", "penny-pinching",
"phony", "petulant", "prejudice", "prejudices", "predictable", "problematic", "provoke",
"provoked", "ridicule", "ridiculous", "reprehensible", "rude", "scandal", "scandalous",
"scapegoat", "scapegoats", "scaremonger", "scaremongering", "shady", "shameful", "shambles",
"sham", "shenanigans", "short-sighted", "silly", "silliness", "slander", "slanderous",
"sleaze", "sleazy", "sly", "slyness", "smokescreen", "sneaky", "spite", "spiteful", "steal",
"stereotyping", "stubborn", "stupid", "stupidity", "subterfuge", "swindling", "tactic",
"talking back", "trick", "trickery", "unacceptable", "unhelpful", "unnatural", "untrue",
"undermine", "outrageous", "vindictive", "villain", "woeful", "wrong"
]import pandas as pd
import json
# Path to your file
file_path = '/Users/KaylaMuller/desktop/text_analysis/week12/cleantext_DonaldTrump.jsonl.txt'
# Read the file line by line and parse each line as JSON
data = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
data.append(json.loads(line))
# Turn into a DataFrame
Trumpdf = pd.DataFrame(data)import pandas as pd
import re
# Make sure your list of words is defined
word_list = set(american_words)
# Compile a regex pattern that matches any of the words, word-boundary safe
pattern = re.compile(r'\b(' + '|'.join(re.escape(word) for word in word_list) + r')\b', re.IGNORECASE)
# Apply a function to count matches in each row
Trumpdf["NegativeWordCount"] = Trumpdf["CleanText"].astype(str).apply(lambda text: len(pattern.findall(text)))
Trumpdf["TotalWordCount"] = Trumpdf["CleanText"].astype(str).apply(lambda text: len(re.findall(r'\b\w+\b', text)))
Trumpdf["neg_ratio"] = Trumpdf["NegativeWordCount"] / Trumpdf["TotalWordCount"] * 100
# Ensure the 'Date' column is in datetime format
Trumpdf["Date"] = pd.to_datetime(Trumpdf["Date"], errors="coerce")
# Drop rows where 'Date' is NaT (invalid dates)
Trumpdf = Trumpdf.dropna(subset=["Date"])
# Extract YearMonth in string format (YYYY-MM) for easier handling in ggplot
Trumpdf["YearMonth"] = Trumpdf["Date"].dt.to_period('M').astype(str)
# Calculate the average 'neg_ratio' by 'YearMonth'
monthly_avg_neg_ratio = Trumpdf.groupby("YearMonth")["neg_ratio"].mean().reset_index()
# Export the result to CSV for use in R
monthly_avg_neg_ratio.to_csv("monthly_avg_neg_ratio.csv", index=False)library(reticulate)
library(ggplot2)
# Load the CSV file (make sure you have the correct path to the file)
df <- read.csv("monthly_avg_neg_ratio.csv")
# Convert 'YearMonth' to a date format
df$YearMonth <- as.Date(paste0(df$YearMonth, "-01"))
# Plot the data
ggplot(df, aes(x = YearMonth, y = neg_ratio)) +
geom_line() +
labs(title = "Monthly Average Aggression Ratio", x = "Month", y = "Aggression Ratio (%)") +
theme_minimal()# Sort by 'YearMonth' to ensure the rolling average works correctly
monthly_avg_neg_ratio = monthly_avg_neg_ratio.sort_values("YearMonth")
# Calculate the two-month rolling average of 'neg_ratio'
monthly_avg_neg_ratio["TwoMonthRollingAvg"] = monthly_avg_neg_ratio["neg_ratio"].rolling(window=2).mean()
# Export the result to CSV for use in R
monthly_avg_neg_ratio.to_csv("monthly_avg_neg_ratio_with_rolling_avg.csv", index=False)library(ggplot2)
library(readr)
library(dplyr)
# Read the data
monthly_avg_neg_ratio <- read_csv("monthly_avg_neg_ratio_with_rolling_avg.csv")
# Convert YearMonth to Date type
monthly_avg_neg_ratio <- monthly_avg_neg_ratio %>%
mutate(Date = as.Date(paste0(YearMonth, "-01")))
# Plot with ggplot
ggplot(monthly_avg_neg_ratio, aes(x = Date)) +
geom_line(aes(y = neg_ratio), color = "blue", linetype = "dashed", size = 1) +
geom_line(aes(y = TwoMonthRollingAvg), color = "red", size = 1) +
labs(title = "Monthly Negative Ratio with Two-Month Rolling Average",
x = "Date",
y = "Negative Ratio (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_date(date_labels = "%Y-%m", date_breaks = "1 month")Analysis of Aggression in the 75th Percentile
# Subset the DataFrame to select only rows where 'neg_ratio' > 0.206258
subset_df = Trumpdf[Trumpdf["neg_ratio"] > 0.206258]
# Calculate the average 'neg_ratio' by 'YearMonth'
subset_monthly_avg_neg_ratio = subset_df.groupby("YearMonth")["neg_ratio"].mean().reset_index()
# Export the result to CSV for use in R
subset_monthly_avg_neg_ratio.to_csv("monthly_avg_neg_ratio.csv", index=False)library(reticulate)
library(ggplot2)
# Load the CSV file (make sure you have the correct path to the file)
df_with_subset <- read.csv("monthly_avg_neg_ratio.csv")
# Convert 'YearMonth' to a date format
df_with_subset$YearMonth <- as.Date(paste0(df_with_subset$YearMonth, "-01"))
# Plot the data
ggplot(df_with_subset, aes(x = YearMonth, y = neg_ratio)) +
geom_line() +
labs(title = "Monthly Average Aggression Ratio for the 75th percentile", x = "Month", y = "Aggression Ratio (%)") +
theme_minimal()Monthly Average Flesch Score
from textstat import flesch_reading_ease
Trumpdf['flesch_score'] = Trumpdf['CleanText'].apply(flesch_reading_ease)
# Calculate the average 'flesch_score' by 'YearMonth'
monthly_avg_flesch_score = Trumpdf.groupby("YearMonth")["flesch_score"].mean()
# Export the result to CSV for use in R
monthly_avg_flesch_score.to_csv("monthly_avg_flesch_score.csv", index=True)library(ggplot2)
library(readr)
library(dplyr)
# Read the data
monthly_avg_flesch_score <- read_csv("monthly_avg_flesch_score.csv")
# Convert YearMonth to Date type
monthly_avg_flesch_score <- monthly_avg_flesch_score %>%
mutate(Date = as.Date(paste0(YearMonth, "-01")))
# Plot with ggplot
ggplot(monthly_avg_flesch_score, aes(x = Date)) +
geom_line(aes(y = flesch_score), color = "blue", linetype = "dashed", size = 1) +
geom_line(aes(y = flesch_score), color = "red", size = 1) +
labs(title = "Monthly Average Flesch Score",
x = "Date",
y = "Flesch Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_date(date_labels = "%Y-%m", date_breaks = "1 month")Analysis of Flesch Score Above the 75th Percentile
# Subset the DataFrame to select only rows where 'flesch_score' > 68.72
subset_df_flesch_score = Trumpdf[Trumpdf["flesch_score"] > 68.72]
# Calculate the average 'neg_ratio' by 'YearMonth'
subset_monthly_avg_flesch_score = subset_df_flesch_score.groupby("YearMonth")["flesch_score"].mean().reset_index()
# Export the result to CSV for use in R
subset_monthly_avg_flesch_score.to_csv("subset_monthly_avg_flesch_score.csv", index=True)library(ggplot2)
library(readr)
library(dplyr)
# Read the data
subset_monthly_avg_flesch_score <- read_csv("/Users/KaylaMuller/Desktop/text_analysis/week12/subset_monthly_avg_flesch_score.csv")
# Convert YearMonth to Date type
subset_monthly_avg_flesch_score <- subset_monthly_avg_flesch_score %>%
mutate(Date = as.Date(paste0(YearMonth, "-01")))
# Plot with ggplot
ggplot(subset_monthly_avg_flesch_score, aes(x = Date)) +
geom_line(aes(y = flesch_score), color = "blue", linetype = "dashed", size = 1) +
geom_line(aes(y = flesch_score), color = "red", size = 1) +
labs(title = "Monthly Average Flesch Score for the 75th Percentile",
x = "Date",
y = "Flesch Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_date(date_labels = "%Y-%m", date_breaks = "1 month")Topic Modeling: Aggression in the 75th Percentile
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
# Step 0: Optional — Make a copy to avoid SettingWithCopyWarning
subset_df = subset_df.copy()
# Setup
stop = set(stopwords.words('english'))
stop.add('applause') # custom stopword
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
# Combined cleaning function
def clean_text(text):
text = text.lower() # lowercase
text = text.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
text = re.sub(r'\d+', '', text) # remove numbers
tokens = word_tokenize(text) # tokenize
tokens = [word for word in tokens if word not in stop] # remove stopwords
tokens = [lemmatizer.lemmatize(word) for word in tokens] # lemmatization
tokens = [stemmer.stem(word) for word in tokens] # stemming
return ' '.join(tokens)
# Apply to DataFrame
subset_df['CleanText_transformed'] = subset_df['CleanText'].apply(clean_text)WordClouds Representing Top Topics: AGGRESSION
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Loop over each topic
for topic_idx, topic_weights in enumerate(lda.components_):
# Create dictionary: word -> weight
word_freq = {feature_names[i]: topic_weights[i] for i in topic_weights.argsort()[:-31:-1]} # top 30 words
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(f"Topic #{topic_idx + 1}")
plt.show()Topic Modeling: Simplicity in the 75th Percentile
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
# Step 0: Optional — Make a copy to avoid SettingWithCopyWarning
subset_df_flesch_score = subset_df_flesch_score.copy()
# Setup
stop = set(stopwords.words('english'))
stop.add('applause') # custom stopword
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
# Combined cleaning function
def clean_text(text):
text = text.lower() # lowercase
text = text.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
text = re.sub(r'\d+', '', text) # remove numbers
tokens = word_tokenize(text) # tokenize
tokens = [word for word in tokens if word not in stop] # remove stopwords
tokens = [lemmatizer.lemmatize(word) for word in tokens] # lemmatization
tokens = [stemmer.stem(word) for word in tokens] # stemming
return ' '.join(tokens)
# Apply to DataFrame
subset_df_flesch_score['CleanText_transformed'] = subset_df_flesch_score['CleanText'].apply(clean_text)WordClouds Representing Top Topics: SIMPLICITY
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Loop over each topic
for topic_idx, topic_weights in enumerate(lda2.components_):
# Create dictionary: word -> weight
word_freq = {feature_names[i]: topic_weights[i] for i in topic_weights.argsort()[:-31:-1]} # top 30 words
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(f"Topic #{topic_idx + 1}")
plt.show()Muller (JCU): Aggression and Complexity